Deep learning-based identification of eyes at risk for glaucoma surgery

To develop and evaluate the performance of a deep learning model (DLM) that predicts eyes at high risk of surgical intervention for uncontrolled glaucoma based on multimodal data from an initial ophthalmology visit. Longitudinal, observational, retrospective study. 4898 unique eyes from 4038 adult glaucoma or glaucoma-suspect patients who underwent surgery for uncontrolled glaucoma (trabeculectomy, tube shunt, xen, or diode surgery) between 2013 and 2021, or did not undergo glaucoma surgery but had 3 or more ophthalmology visits. We constructed a DLM to predict the occurrence of glaucoma surgery within various time horizons from a baseline visit. Model inputs included spatially oriented visual field (VF) and optical coherence tomography (OCT) data as well as clinical and demographic features. Separate DLMs with the same architecture were trained to predict the occurrence of surgery within 3 months, within 3–6 months, within 6 months–1 year, within 1–2 years, within 2–3 years, within 3–4 years, and within 4–5 years from the baseline visit. Included eyes were randomly split into 60%, 20%, and 20% for training, validation, and testing. DLM performance was measured using area under the receiver operating characteristic curve (AUC) and precision-recall curve (PRC). Shapley additive explanations (SHAP) were utilized to assess the importance of different features. Model prediction of surgery for uncontrolled glaucoma within 3 months had the best AUC of 0.92 (95% CI 0.88, 0.96). DLMs achieved clinically useful AUC values (> 0.8) for all models that predicted the occurrence of surgery within 3 years. According to SHAP analysis, all 7 models placed intraocular pressure (IOP) within the five most important features in predicting the occurrence of glaucoma surgery. Mean deviation (MD) and average retinal nerve fiber layer (RNFL) thickness were listed among the top 5 most important features by 6 of the 7 models. DLMs can successfully identify eyes requiring surgery for uncontrolled glaucoma within specific time horizons. Predictive performance decreases as the time horizon for forecasting surgery increases. Implementing prediction models in a clinical setting may help identify patients that should be referred to a glaucoma specialist for surgical evaluation.


Data collection
This is a retrospective longitudinal study of glaucoma patients followed at the Wilmer Eye Institute between 2013 and 2021.We included eyes with at least one set of baseline reliable VF data, reliable OCT data, clinical data (visual acuity, IOP) and demographic data (age, gender, and race) from the same visit.VF testing was done with the Humphrey Field Analyzer using the SITA Standard/Fast/Faster test strategy and 24-2 test pattern.OCT data were obtained with CIRRUS HD-OCT (Zeiss, Dublin, CA).Data were extracted from EPIC (Epic Systems, Madison, WI) and FORUM (Zeiss, Dublin, CA).
Previously published criteria 11 were used to define reliable VF tests: less than 15% false positives and less than 25% false negatives for mild/moderate glaucoma (MD > − 12 dB); less than 15% false positives and less than 50% false negatives for severe glaucoma (MD ≤ − 12 dB).Reliability criteria for OCT consisted of hav- ing a signal strength of 6 or greater, and greater than 30 μm for average and superior/inferior quadrant retinal nerve fiber layer (RNFL) thickness.We set the criterion for RNFL thickness at 30 μm to account for eyes with artifacts (i.e., segmentation errors) that would cause RNFL thickness to drop well below the measurement floor of approximately 57 microns on Cirrus OCT 16,17 .
Included eyes were randomly selected at the patient level, which means that if a patient has multiple VF/ OCT/clinical test records for the same eye or for both eyes within the same time interval, we randomly selected one record and excluded the others.Inclusion at the patient level was deemed more appropriate because ignoring within-subject correlations may result in overestimating the accuracy of model performance on the test set. www.nature.com/scientificreports/

Defining time horizons and labeling eyes
We trained separate DLMs to predict eyes at high risk for future surgery for 7 different time horizons after the first VF/OCT/clinical (baseline) visit: within 3 months, within 3-6 months, within 6 months-1 year, within 1-2 years, within 2-3 years, within 3-4 years, and within 4-5 years.Separate DLMs were trained instead of a single DLM to maximize predictive power.Eyes were labeled as having surgery if they underwent either trabeculectomy, tube shunt, xen, or diode surgery (procedures with CPT codes 66,170, 66,172, 66,180, 66,179 66,183 or 0449 T) within the specified time horizon.While there are a variety of glaucoma procedures available to control IOP, these are the procedures that were most often performed for uncontrolled glaucoma among glaucoma practitioners at the Wilmer Eye Institute during the study period.Angle-based procedures and other less invasive procedures are often done in conjunction with phacoemulsification in medically controlled glaucoma and do not generally denote uncontrolled glaucoma in our practice.Therefore, such procedures were not included in this study as the goal was to identify high risk/uncontrolled eyes.Nonsurgical eyes were defined as glaucoma or glaucoma-suspect patients who did not undergo glaucoma surgery.
Patients included in this study were required to have their first VF, OCT, and clinical (baseline) ophthalmology visits on the same date.For surgical patients, the time interval between baseline visit and surgery was required to be within one of the time horizons (e.g., within 3 months, 3 to 6 months etc.).For non-surgical patients, the time interval between the baseline visit and the second ophthalmology visit was required to be within one of the time horizons.Additionally, nonsurgical patients were required to have a follow-up visit after the specified time horizon.

Preparing data for deep learning
For each time interval, the included eyes were randomly split into 60%, 20%, and 20% for training, validation, and testing.For the input, we spatially oriented the OCT RNFL-thickness data into a 12 × 12 grid to match the clock hour and quadrant values.Further, we also radially imputed the total deviation values from 24-2 Humphrey VFs to fill out a 12 × 12 grid.Then, the 3 images were stacked to form a 3-channel image for every eye, which was then fed into a vision transformer (ViT) 18 for feature extraction.Data augmentation techniques-random horizontal flip, zoom, rotation, and skew augmentation-were applied to spatially aligned VF and OCT images to reduce overfitting 19 .

Deep learning model overview
In recent years, there has been notable progress in the development of attention-based DLMs 20,21 .Attention-based DLMs have been successfully applied in the fields of glaucoma detection [22][23][24] , fundus retinal vessel segmentation 25 , and glaucoma progression forcasting 7 .ViTs have recently emerged as a competitive alternative to convolutional neural networks (CNNs) in image processing.When pre-trained on large amounts of data and transferred to tasks with fewer datapoints, ViTs match or exceed the performance of state-of-the-art CNNs on image classification tasks while requiring fewer training computational resources 18 .ViTs can also be used as feature extractors.Previous research has shown that using ViTs as feature extractors may help deep learning models achieve better accuracy 26,27 .Inspired by this previous research, we employed a ViT to integrate spatial information into the DLM for the prediction of glaucoma surgery outcomes.We used the DLM architecture depicted in Fig. 1 to predict the probability of glaucoma surgery within specific time horizons.
The spatially oriented three-channel VF and OCT images included 54 radial total deviation values from 24-2 Humphrey VFs, four quadrants of OCT RNFL thickness values, and 12 clock hour OCT RNFL thickness values.A ViT was then used to obtain a vector of the spatial features.These spatial representations of VF and OCT images were then concatenated with 6 VF features (False Positives, False Negatives, Fixation Losses, Test Duration, MD, PSD), 6 OCT features (RIM Area, Disc Area, Vertical Cup Disc Ratio, Cup Volume, Average RNFL Thickness, Signal Strength), 2 clinical features (visual acuity measurement, IOP) and 3 demographic features (age, gender, and race), and fed into a fully connected neural network to predict the probability of the occurrence of glaucoma surgery within the specified time horizon.
We compared AUC values of our DLMs to AUC values of logistic regression models and end-to-end fully connected neural network (NN) models that did not use a ViT.Statistical significance for AUC was assessed using the DeLong 28 test.Logistic regression and NN classifiers incorporated all available information as inputs: 60 VF measures (54 radial total deviation values and 6 global metrics), 22 OCT measures (4 quadrants of OCT RNFL thickness values, 12 clock hour OCT RNFL thickness values, and 6 global OCT metrics), 2 clinical features and 3 demographic features.The outputs were the probability of glaucoma surgery within specific time horizons.To reduce the probability of overfitting, we used L1 (Lasso) 29 and L2 (Ridge) 30 regularization for logistic regression and early stopping with NN 31 .L1 regularization introduces a penalty term in the objective function that sums the absolute value of the coefficients, whereas L2 regularization adds a penalty term that sums the square of the coefficients-in both cases, complexity is penalized, which reduces overfitting.The logistic regression parameters were fine-tuned using grid-search 32 .This process evaluates the model's performance for various combinations of parameters and selects the optimal values.

Main outcome measures
DLM performance was measured on the 20% held out test set using AUC and precision-recall curves (PRC).Sensitivity (recall), specificity, precision (positive predictive value), and F1 score (the harmonic mean of recall and precision) were also used as evaluation metrics.To convert the estimated probability of surgery into a binary prediction, we used the maximum value of Youden's Index (J)-mathematically defined as J = sensitivity + specificity − 1 33 -to select the optimal thresholds 34 for classification.If the predicted prob- ability was greater than the classification threshold, the eye was predicted to be surgical, otherwise non-surgical.www.nature.com/scientificreports/Youden's Index gives equal weight to false positives and false negatives.For clinical deployment, this threshold could be adjusted to meet the clinician preferences.SHAP values were used to estimate feature importance both globally and locally (i.e., at the patient level).When multiple DLMs for different time horizons surpassed a predetermined decision threshold, the DLM for the shortest time interval was implemented.For instance, if an eye was identified as requiring surgery for uncontrolled glaucoma within 0-0.25 year, 0.25-0.5 year, and 0.5-1 year timeframes, the 0-0.25 year time horizon would be selected as the prediction.

Results
Summary of key demographics, VF, OCT, and clinical characteristics of surgery and non-surgery eyes are presented in Tables 1 and 2. Compared to non-surgery eyes in the same time horizon, surgery eyes were more likely to have higher IOP, higher PSD, longer test duration, lower MD, and lower RNFL thickness.The exception was in the 4-5 year time interval, where the median IOPs of surgical and non-surgical eyes were identical.The difference between IOP and glaucoma severity as measured by VF and OCT metrics in the surgery and non-surgery eyes was greatest in the 0-3 month time horizon.This difference tended to become smaller as the time horizon increased.ROC and PRC for separate DLM models are depicted in Fig. 2. The curves are color-coded in a rainbow pattern, with red representing 0-3 months (0-0.25 years) and violet representing 4-5 years.The DLM predicting surgery within 3 months had the best forecasting performance as well as the highest F1 and the highest precision.
AUC, sensitivity, specificity, precision, recall and F1 are shown in Table 3.The DLM for the shortest time horizon of surgery (within 3 months) achieved an AUC of 0.92 (95% CI 0.88, 0.96), a F1 of 0.73, a sensitivity of 0.83, and a specificity of 0.82 for predicting glaucoma surgery.Predictive performance decreased as the time horizon  The SHAP summary plot and SHAP feature importance plot for the 0-3 month DLM are shown in Fig. 3A  and B respectively.The y-axis represents the top 20 most important features sorted by their global impact, and the x-axis represents the Shapley value.Each dot on the summary plot (Fig. 3A) represents one predicted case.The color indicates the value of the feature's importance, from low (blue) to high (red).The higher the SHAP    value of a feature, the more important the feature is to the surgical prediction.In the SHAP feature importance plot (Fig. 3B), bar lengths show the average impact of the individual features on the model's prediction.For the 0-3 months DLM, IOP is the most important feature followed by MD and PSD.These features are similar to factors that a clinician may take into account when making the decision to proceed with surgery.The top 5 most important features calculated by Shapley 35 values for DLMs at the various time horizons are listed in Table 5.All 7 models placed IOP within the top 5 most important features.MD and average RNFL thickness are listed   Figure 4A shows a decision plot (local feature importance) for an eye that is predicted to need glaucoma surgery within 3 months, while Fig. 4B shows an eye that is predicted to not need surgery within 3 months.The x-axis at the top of the plot represents the eye's predicted probability for surgery.The y-axis lists the top 20 most important features in order of decreasing importance that affect eye-level prediction.The feature values of each eye are printed in the corresponding space.Moving from bottom to top in order of increasing importance, SHAP values of all features are added to the model's base value at 0.4 (the average of all predictions made by DLM), arriving at the DLM's output with 0.63 for the eye in Fig. 4A and 0.09 for the eye in Fig. 4B.If a feature increases the probability of predicting surgery, the line moves to the right.If a feature increases the probability of a nonsurgery prediction, the line moves to the left.The decision threshold, selected by the maximum value of Youden's Index (J), 0.6, was utilized to convert the probability of surgery into the final binary DLM prediction (at the top of the graph).In Fig. 4A, PSD, average RNFL thickness, and MD are three of the most influential features that increase the predicted surgery probability.In Fig. 4B, RIM area, vertical cup disc ratio, and IOP are three of the most influential features that decrease surgery probability.

Discussion
In this study, we developed DLMs that were able to forecast future glaucoma surgery within 3 years with clinically useful AUC values using multimodal data (VF, OCT, and clinical information) from a single clinical encounter.Model performance steadily declined when forecasting surgery further into the future.SHAP values were used to estimate feature importance both globally and locally.The features that were most important in predicting  the occurrence of surgery included high IOP and worse glaucoma severity as measured by VF and OCT testing, which is consistent with clinical decision making.
Although previous studies utilized machine learning for predicting glaucoma surgery, our model excels in early identification and demonstrates better AUC than previous models.Baxter et al. 12 developed a logistic regression model to predict surgical intervention within 6 months based on EHR data with an AUC of 0.67.Wang et al. 13 developed a DLM to predict glaucoma surgery within 120 days with an AUC of 0.73 based on structured and unstructured EHRs.Some predictive models for glaucoma progression used VF data with clinical information (e.g., IOP) in addition to OCT RNFL thickness 7,8 , but require multiple follow-up to make predictions.Our DLMs achieved AUC values over 0.8 from a single baseline ophthalmology visit alone, potentially mitigating issues arising from poor adherence to recommended follow-up schedules.
Our DLMs also makes surgical predictions for different time intervals, up to 5 years in the future.When forecasting further into the future, model performance decreased.This is likely due to certain factors such as high IOP and advanced glaucoma damage being associated with an urgent need for surgery.If the need for surgery is less clear (e.g., borderline IOP, moderate glaucoma damage), clinicians may wait longer due to modest success rates and higher risks associated with these surgeries.For example, the rate of failure of trabeculectomy and tube shunts are approximately 10% per year 36 .There is also a high risk of vision loss with traditional glaucoma surgery: at least 2% of patients experience long-term severe vision loss after surgery 37 .
Another contribution is investigating feature importance using a locally interpretable model-agnostic framework.From SHAP feature importance analysis, lower MD, higher IOP, thinner average RNFL thickness and higher PSD were the top 4 features that contributed to the DLM decision to predict surgery.These results are consistent with previous studies (2021) 38 which have demonstrated that higher IOP with more severe glaucoma (i.e., low MD, high PSD) is associated with an increased rate of progression of glaucomatous VF loss.However, beyond these easy-to-interpret features, it is likely that our ViT based DLMs are using the spatial relationships between the VF and OCT data to predict the risk for surgery.
Our study has several strengths, including using a large multimodal real-world dataset to develop and test our models.We developed DLMs that can make predictions based on the baseline ophthalmology visit alone which may address the problem caused by poor adherence to recommend follow-up.We also explored model performance for different time horizons, which may be important for patient triaging (e.g., if the model recommends surgery within 3 months, this eye is likely at higher risk than a model that recommends surgery within 12 months).Our work also has several limitations.First, the DLM was trained on a dataset of patients undergoing treatment at a tertiary care glaucoma center and may not be generalizable to other settings.Our definition of surgery for uncontrolled glaucoma was also based on the procedures most often performed by clinicians in this practice (trabeculectomy, tube shunt, diode, xen), and it is possible that clinicians who perform other types of procedures for uncontrolled disease (i.e., GATT) may have higher or lower thresholds for deciding to proceed with surgery, which may have an impact on model generalizability.Glaucoma surgery is also only a surrogate for glaucoma progression (i.e., having surgery does not necessarily mean the eye would have progressed without surgery).Additionally, other factors that are not captured in our data set, such as surgeon preference, patient refusing, higher than normal risk may factor into the decision to pursue surgery.Finally, the multimodal data required by our model (particularly OCT and VF) may be difficult to obtain in resource-limited settings, which may limit the deployment of such models.If future studies demonstrate that our DLMs are validated prospectively and externally and found to be generalizable, it is feasible that they can be deployed in clinical practice.For instance, surgery prediction software can be deployed by a general ophthalmologist or optometrist offices to triage high-risk glaucoma patients who need a prompt referral to a glaucoma specialist for consideration of more aggressive management.Such prediction software can not only triage the patients but also can alert clinicians to potential high-risk patients who might otherwise be overlooked due to various human errors.However, a notable consideration in the application of AI in the medical field is the possibility that future models could predominantly learn from the behavior of implemented AI systems rather than from the expertise of human surgeons.Further research will be needed to mitigate this issue.
In the future, we endeavor to incorporate patients' medication and surgical history data to enhance model performance.Additionally, we intend to conduct a user study involving comprehensive eye care providers who often make surgical referrals to glaucoma specialists.This study aims to gain a deeper understanding of their needs regarding surgical intervention prediction.The goal is to refine both the DLM and its interpretability, ultimately enhancing its effectiveness for clinical practice.
In conclusion, we developed DLMs that predict eyes at high risk for future surgery using multimodal data from an initial visit.The DLMs achieved clinically useful AUC values (> 0.8) for all models that predicted the occurrence of surgery within 3 years.Implementing such prediction models in a clinical setting can help stratify high-and low-risk patients early in the disease course, facilitating prompt referral to glaucoma specialist for surgical management. https://doi.org/10.1038/s41598-023-50597-0

Figure 1 .
Figure 1.Schematic of our deep learning model.Data augmentation techniques-random horizontal flip, zoom, rotation, and skew augmentation-were first applied to the VF-OCT stack.Then, spatially aligned VF and OCT images were input into the Vision Transformer (ViT).ViT-extracted features were then concatenated with VF, OCT, clinical and demographic data, and fed into a fully connected classifier to predict the occurrence of glaucoma surgery within the specified time horizon.This ViT architecture was described by Dosovitskiy et al.

Figure 2 .
Figure 2. ROC and PRC for DLMs in different time intervals.The curves are color-coded in a rainbow pattern.(A) Receiver operating characteristic curves and (B) Precision recall curves for the 7 different DLMs for different time horizons.

Figure 3 .
Figure 3. Feature importance for the within 3 months DLM model listed in decreasing order.(A) Each point on the summary plot is a Shapley value for a feature from a single prediction.Red dots increase the probability of a surgery prediction, whereas blue dots increase the probability of a non-surgery prediction.(B) Mean absolute Shapley values.IOP, MD, and PSD are the top three most important features.

Figure 4 .
Figure 4. Decision plot: visualize model decisions using cumulative SHAP values.Moving from bottom to top, SHAP values of all features are added to the model's base value.Each prediction starts from the bottom of the plot at model's base value at 0.4 (probability) and hits the x-axis at 0.63 for the eye in (A) and 0.09 for the eye in (B).(A) One eye predicted to need glaucoma surgery within 3 months.(B) One eye predicted to not need surgery within 3 months.

Table 1 .
Baseline demographics and clinical characteristics of surgery and non-surgery eyes for different time horizons.IQR interquartile range, IOP intraocular pressure.

Table 2 .
Baseline key VF and OCT characteristics of surgery and non-surgery eyes for different time horizons.IQR interquartile range, RNFL retinal nerve fiber layer, MD mean deviation, PSD pattern standard deviation.

Table 3 .
Diagnostic accuracy of DLM performance in identifying eyes at risk of surgery for uncontrolled glaucoma.

Table 4 .
Performance metrics for different models in identifying eyes at risk of surgery for uncontrolled glaucoma.A comparison of AUC between models to determine if performance differences were statistically significant (p < 0.05) using the DeLong Test.*p < 0.05 when comparing the model AUC to the DLM at the same time horizon.

horizon (years) Logistic regression AUC (95% CI) Neural network AUC (95% CI) DLM AUC (95% CI)
among the top 5 most important features by 6 of the 7 models.PSD is ranked among the top 3 most important features in 5 of the 7 models.

Table 5 .
Top 5 most important features calculated by SHAP value for models at the various time horizons listed in decreasing order.